feat: add kep-98:ResourceScalingGroup (RSG) - Advanced Scaling API fo…#120
Open
Mag-FelixFelicis wants to merge 1 commit into
Open
feat: add kep-98:ResourceScalingGroup (RSG) - Advanced Scaling API fo…#120Mag-FelixFelicis wants to merge 1 commit into
Mag-FelixFelicis wants to merge 1 commit into
Conversation
…r Distributed AI Inference
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
cheyang
reviewed
Jun 11, 2026
cheyang
left a comment
Collaborator
There was a problem hiding this comment.
Thanks for putting this KEP together. The motivation around multi-role coupled scaling and non-intrusive adoption of existing workloads is compelling. A few things I'd like to see addressed before this is ready:
API Group & Project Alignment
- The proposed
resourcescalinggroup.com/v1API group doesn't align with the project's existing convention (workloads.x-k8s.io). Is there a reason RSG lives in a separate API group? If this is intended to be part of the RBG ecosystem, it should probably be under the same group. - The relationship between RSG and RBGSA (RoleBasedGroupScalingAdapter) needs to be spelled out explicitly. Does RSG supersede RBGSA for these use cases, or do they coexist? What's the migration path?
Type Definition Inconsistencies
Ratiois declared asint32in the Go struct, but the comment says "Using string to support decimals safely in JSON (parsed as float64 in controller)." The YAML examples use1.0and2.0. Pick one representation and make it consistent. If you need fractional ratios (e.g., 0.25),int32won't work.- In
resourcescalinggroup.yaml, the second example hasroleName: "Decode"(capitalized) but the KEP body usesroleName: "decode"(lowercase). Which is canonical?
Underspecified Mechanisms
- HPA integration: Status has a
LabelSelectorfield "used by HPA to discover the resource," implying RSG implements the Scale subresource. This is a critical integration point that deserves its own section — how does HPA target RSG? What does the/scaleendpoint look like? - GroupReplication scale-out trigger: Who/what increments
replicas? If HPA drives it, the Scale subresource contract is essential. If it's manual, say so explicitly. - Name resolution in InplaceScaling bindings: In the second example, bindings reference
prefillanddecodeby name, but those aresubResourcesofrolebasedgroup1, not top-level entries intargetResources. How does the controller resolve this? The indirection needs documenting. - Scale-down candidate discovery: How are group IDs (e.g.,
inference-unit-02) assigned, tracked, and communicated to external systems like traffic gateways? Without this, the "zero-downtime" story is incomplete.
Missing Sections
- Test Plan (Unit, Integration, E2E) — all empty. At minimum, outline what the key test scenarios are.
- Alternatives — what other approaches were considered? Why not extend RBGSA instead of introducing a new CRD?
- Graduation criteria and timeline are absent.
Deletion Semantics
- Orphan handling only covers InplaceScaling (Retain). What happens in GroupReplication mode when RSG is deleted — are cloned resources garbage collected via OwnerReferences, or also retained?
Minor
- Both files are missing a trailing newline.
- The KEP file has spaces in its name (
KEP-98 ResourceScalingGroup...), which can cause issues with some tooling. Consider using hyphens. - The webhook validation for circular dependency prevention is mentioned but not specified. At least outline the algorithm (topological sort on the binding graph).
Overall the design has legs, but there's enough underspecification that it's hard to evaluate feasibility or review an implementation against this. Looking forward to a revision that fills in the gaps.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…r Distributed AI Inference
Ⅰ. Motivation
Ⅱ. Modifications
Ⅲ. Does this pull request fix one issue?
fixes #XXXX
Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅴ. Describe how to verify it
VI. Special notes for reviews
Checklist
make fmt.